Are Nonsignificant Differences Really Not Significant?
نویسنده
چکیده
Many researchers and reviewers hesitate to publish data when treatments are not “significantly” different. Most of us were taught to use a particular probability value (usually 5% or 1%) as a critical value to test hypotheses. The term “significant” has come to be synonymous with the phrase “significant at the 5% level.” Using relatively conservative critical values prevents readers from evaluating, or even seeing, some data that should be published. When using 5% as a critical value, we would conclude that two treatments are different when the P-value is 0.05, but not when it is 0.051. Reviewers often suggest that data not differing at the 5% level should not be included in a publication, but the author should state that the difference was not significant. Is this an intelligent use of statistical tests? Researchers use such tests to help interpret experimental data. Probability values allow us to determine the probability of incorrectly declaring that responses are different when the difference is actually due to chance. The use of critical levels to test hypotheses has led to a black or white (reject vs. fail to reject) view of data when the real world is mostly gray. When a probability value of 5% is used as a critical value, there is a 5% chance that the null hypothesis will incorrectly be rejected. In other words, 5% of the time responses are declared different when the difference is really due to chance rather than to treatment. This type of mistake is known as a Type I error. The probability of a Type I error is α. Most researchers do not consider a Type II error, which is the probability of incorrectly concluding that treatment differences are due to chance rather than to experimental treatments. The probability of a Type II error, denoted as β, is determined by the choice of α and the difference between two treatment means. Power (1 – β) is the ability to detect differences that are due to treatment. Ideally we would like to have a low probability of making both types of errors, but Type I and Type II errors are inversely related, and decrease only as sample size increases. The concept of Type II error is especially important in determining the sample size needed to detect a difference of a specified magnitude. When both types of error are considered in the case of a fixed sample size, a reduction in Type I error must be accompanied by an increase in Type II error. To decrease Type I error without increasing Type II error requires either increasing the sample size or reducing the magnitude of the treatment differences we are able to declare as being significant. The concept of critical P-values is presented in many statistics texts, is used by many statisticians, and is accepted in most scientific journals. The choice of any particular P-value as a “critical value” is purely arbitrary. Historically, obtaining exact probability values was difficult. Therefore, most statistical texts included tables with values of the probability of obtaining larger z, t, and F statistics. Although some texts had slightly more extensive tables, many contained values for only the 1% and 5% levels because of the expense of large tables. The concept of always using the same probability value as a “cut-off” level is not only arbitrary, but also oversimplifies a complex issue. Researchers should choose a level of probability based both upon the seriousness of failing to reject the hypothesis when it is false and the probability of rejecting it when it is true (Steele and Torrie, 1980). For example, when comparing the efficacy of two similarly priced plant growth regulators, a liberal test (possibly the 20% level of significance) may be desirable because the consequences of incorrectly declaring one product superior to the other are not serious. However, if one product were four times more expensive, a more conservative P-value would reduce the likelihood of incorrectly declaring the more expensive product superior to the less expensive product. One may also want to consider the sample size and the amount of variation inherent in some types of data. The 5% probability level is probably too conservative for experiments where large sample sizes are not practical or are too expensive. Some types of data are quite variable and would require extraordinarily large sample sizes to detect differences at the 5% level of significance. For example, >30 peach trees per treatment are required to detect a 15% difference in yield at the 5% level of significance (Marini, 1983). Therefore, if a 10% difference in yield is considered economically important, using a more liberal P-value may be justified. Several years ago, Wehner and Shaw (1994) suggested that authors be allowed to present comprehensive analysis of variance (ANOVA) tables in ASHS publications. I agree with their recommendations and I would like to suggest two additional changes in the way the data are presented. 1) Rather than using asterisks to indicate the arbitrarily chosen probability level at which the null hypothesis is rejected, present the exact P-values. While perusing the last six issues of HortScience published in 1997 (Vol. 32), I noticed that P-values were presented in only five of 98 articles. Modern statistical software packages provide these values by default. Presentation of P-values allows readers to develop their own interpretation of the data. I am not a gambler, but as an extension specialist I am willing to recommend a new inexpensive practice that increases yield by 15% even when there is a 20% (P = 0.20) probability that the yield increase was not due to the new practice. 2) Publish data that are significant at some
منابع مشابه
لزوم توجه به مفروضات مدل ژنتیکی تجزیه دای آلل
Diallel crosses among 6 Avena sativa L. and A. sterilis L. lines and introductions were used to evaluate the validity of the assumptions for the genetic model. Number of days to pollination, plant height at pollination and at maturity, as well as grain and stem protein percentages were evaluated. According to Griffing's method 1 the reciprocal mean squares for all the traits under study were si...
متن کامللزوم توجه به مفروضات مدل ژنتیکی تجزیه دای آلل
Diallel crosses among 6 Avena sativa L. and A. sterilis L. lines and introductions were used to evaluate the validity of the assumptions for the genetic model. Number of days to pollination, plant height at pollination and at maturity, as well as grain and stem protein percentages were evaluated. According to Griffing's method 1 the reciprocal mean squares for all the traits under study were si...
متن کاملStatistical Significance Versus Clinical Importance of Observed Effect Sizes: What Do P Values and Confidence Intervals Really Represent?
Effect size measures are used to quantify treatment effects or associations between variables. Such measures, of which >70 have been described in the literature, include unstandardized and standardized differences in means, risk differences, risk ratios, odds ratios, or correlations. While null hypothesis significance testing is the predominant approach to statistical inference on effect sizes,...
متن کاملSerum prolactin level in psoriasis: Is it really higher than in healthy individuals?
Background: Psoriasis is a chronic immune-mediated skin diseasewith a genetic predisposition. Prolactin may contribute to psoriasispathogenesis. However, there has been a debate over the serumlevel of prolactin in psoriatic patients. The aim of this study was todescribe the role of serum prolactin in the pathogenesis of psoriasisMethod: The serum prolactin level was measured in 45 patientswith ...
متن کاملValidating Locus of Control Questionnaire and Examining its Relation to General English (GE) Achievement
Locus of control is said to affect learners' academic achievement. This effect has scarcely been examined within general English context. This study is concerned with examining the differences in General English (GE) course achievement among university students of humanities, sciences, and engineering. It also explores the effect of locus of control (LOC) in GE course achievement among these th...
متن کاملDeterminants of Subjective Well-Being; Do We Really Know What Makes People Happy? : A Study Among Rasht Dwellers as a Metropolis in North of Iran
Recently, along with traditional economic indicators, policymakers are increasingly dealing with subjective well-being (SWB) as an evaluation criterion of their performance and as an index for the population’s psychology health. This study tries to define different determinants of SWB with a focus on some specific aspects of the living area. Also, this article investigates outskirt-urban differ...
متن کامل